﻿ EuReCo - Joining Forces for a European Reference Corpus as a sustainable base for cross-linguistic research 11,2,3145,67 Marc Kupietz, Andreas Witt, Piotr Bański, Dan Tufiş, Dan Cristea, Tamás Váradi 1 Institut für Deutsche Sprache, Mannheim 2 University of Cologne, Faculty of Arts and Humanities 3 Heidelberg University, Department of Computational Linguistics 4 Institute for Artificial IntelligenceMihai Drăgănescu, Bucharest 5 Romanian Academy, Institute for Computer Science - Iaşi 6 “Alexandru Ioan Cuza” University of Iaşi, Department of Computer Science 7 Research Institute for Linguistics, Hungarian Academy of Sciences (kupietz|witt|banski)@ids-mannheim de,tufis@racai ro,dcristea@info uaic ro,varadi tamas@nytud mta hu Abstract2 Aims 2 1 Comparable corpora In this paper we discuss the opportunities, One of the aforementioned growing needs is the prerequisites, possible applications and im- need for comparable corpora in order to facilitate plications of a virtually joint corpus based contrastive and generally cross-linguistic research on existing national, reference and other beyond the possibilities provided by parallel cor- large corpora and their responsible institu- pora, which are very much limited for linguistic ap- tions plications by unavoidable translation biases This application area is also the initial and currently 1 Introduction the main focus of EuReCo It appears that join- ing forces in this area is a particularly promising The past 20 years have seen an emergence of na- prospect: given that several national and reference tional, reference and other large corpora of numer- corpora are built and maintained anyway and in- ous European languages (Aston & Burnard, 1998; dependently, with methodologies and techniques Váradi, 2002; CNC, 2005; Geyken, 2007; Baroni developed for joining them virtually, where each et al , 2009; Davies, 2010; Kupietz et al , 2010; national centre is still responsible for its language Przepiórkowski et al , 2010; Oravecz et al , 2014; and each corpus still physically located at its cen- Tufiş et al , 2016) Most of them have been or are tre, it should be much more economical, scalable being built in projects of limited duration, but typ- and sustainable to create comparable corpora from ically based at institutions that are at least to some scratch, possibly even at more than one centre degree responsible for curating data and for mak- 2 2 Further aims ing it available to the respective scientific commu- nities also after the building phase The idea of Eu- In the meantime, however, the envisioned EuReCo ReCo, which has been around in the CMLC work- has acquired a broader range of potential applica- shop series since 2012 (see Bański et al , 2012), tions: if the organisational and technical prereq- is that such institutions, rather than continuing as uisites for such an infrastructure will prove to be “research islands”, should join forces and experi- feasible, it would be wise to identify – as early as ment whether a well-designed technology could al- possible – further functionalities that are currently, low a unifying view on building and exploitation of or in the near future, required by the collaboration a multilingual collection of comparable corpora, a partners, as for example the ability to manage very goal motivated by the rapidly changing and grow- large corpora, statistical analysis – ideally dynami- ing variety of needs of the linguistic and related cally offered to the user, and support for querying user communities different kinds of linguistic annotations We present in this paper such a joint project, The general goal of the EuReCo initiative is to called EuReCo, briefly showing its aims, the tech- bring together existing European corpus initiatives, nology behind and the language resources in- specifically in those areas where synergy effects volved can be expected with high certainty and in a very much target-oriented fashion, towards goals that–the product of the collaboration is actually the collaboration partners would like to achieve,useful for the collaboration partners and their but are unlikely to achieve alone in a sufficientlyresearch communities, –the overhead of the collaboration does not out- effective and sustainable way weigh its synergy effects, Apart from these rather economical aspects, Eu- ReCo also expects benefits from bringing closer to- •commonly developed and used tools must ac- gether research communities that are currently cen- knowledge the fact that the corpus data itself tered around philologies and their sub-disciplines may not leave its hosting organizations, •the collaboratively developed tools will usually 2 3 Relation to CLARIN not replace, but only complement those already Thus, the EuReCo objectives are much narrower existing and oriented towards target applications than those of the European Language Resource In- 4 Previous and current work frastructure Project CLARIN, which “makes dig- ital language resources available to scholars, re- 4 1 KorAP searchers, students and citizen-scientists from all The current main technical basis for EuReCo is the disciplines, especially in the humanities and so- corpus query and analysis platform KorAP that has cial sciences, through single sign-on access ” (see recently been developed at the IDS (Bański et al , www clarin eu) In contrast to CLARIN, which 2013; 2014, Diewald et al , 2016) KorAP is the has been particularly strong at providing horizon- designated successor of the corpus search and man- tal base layers of infrastructure, standards and best agement system COSMAS, which was launched in practices, EuReCo will typically aim at vertical 1994 and in its second incarnation (COSMAS II), columns ending directly at end-user applications is still currently used by 39 000 researchers work- ing on the German language Besides KorAP’s 3 Foundations more performance-oriented features, such as hor- Despite the differing scope and objectives, Eu- izontal scalability with respect to an unbounded ReCo will, of course, necessarily be tightly inte- corpus size and any number of annotation lay- grated into CLARIN In addition, its roots lie in ers, two are particularly fundamental for EuReCo: a number of experiences gathered by its collabora- 1 ) its ability to manage corpora that are physi- tion partners and their respective histories of pro- cally located at different places, in order to com- viding corpora and tools for using them, in large ply with typical license restrictions (cf Kupietz part within CLARIN: et al , 2014) and 2 ) its ability to dynamically create virtual sub-corpora based on text properties •contemporary corpora are always tied to their and to manage these virtual corpora in a persistent hosting organizations by license contracts and way, to e g allow for reusability and reproducibil- other legal restrictions (Kupietz et al , 2014), ity In addition, using a micro-service-like archi- •the way linguists use corpus data is itself subject tecture, KorAP has been specifically designed for to rapidly developing research, collaborative development and particularly collab- •the exact requirements of corpus search and anal- orative extensibility up to the end-user Extensibil- ysis tools for different corpora differ with re- ity is also KorAP’s main approach to Jim Gray’s search traditions and target communities, famous postulate “put the computation near the •there will be no single tool that satisfies all user data”,which is essential not only to cope with needs, big data, but also to cope with intellectual property •unification is often necessary to keep costs man- rights (IPR) restrictions ageable and to allow for re-usability, but one has to be very careful to keep the results usable and 4 2 CoRoLa useful This is a priority project of the Romanian Based on these insights, the EuReCo strategy can Academy, carried on by the Institute of Artificial be characterized by the following key properties: Intelligence “Mihai Drăgănescu” in Bucharest and the Institute for Computer Science in Iaşi, both af- •the aims of the collaboration have to be carefully filiated with Romanian Academy When finalised picked and outlined in order to guarantee that: (end of 2017), CoRoLa will be the largest cor-is centered around the German Reference Cor- pus of Romanian contemporary language, includ-pus DeReKo (Kupietz, et al , 2010) and the Ref- ing both written and spoken data The distinctiveerence Corpus of Contemporary Romanian Lan- aspect of the CoRoLa project (Tufiş et al , 2016) isguage CoRoLa (Tufiş,et al , 2015) One of its that all the data included into the reference corpusmain objectives is to provide a common platform have cleared IPR, based on bilateral agreementsfor constructing various kinds of comparable cor- between the developing institutions and the datapora, based on text properties and for analysing providers The migration of CoRoLa data to thethem for contrastive linguistic purposes new DRuKoLA environment assumes new encod- The present state of the EuReCo-relevant ing and indexing methods, mapping annotations, DRuKoLA part is that a converter from CoRoLa- etc , so that the users could enjoy all the facilities TEI-format to KorAP-XML-format has been im- of the KorAP query platform plemented so that CoRoLa can now be accessed via KorAP For the present moment, a large part 4 3 The Hungarian corpus (60%, ˜300 million words) of the textual content The Hungarian National Corpus is a balanced refer- of CoRoLa has been incorporated as the Romanian ence corpus intended to capture varieties of present part of the DRuKoLA content The next step will day Hungarian five selected major genres such as be to fine-tune a first version of mapping functions journalism, literature, (popular science), personal, from CoRoLa andDeReKometadata categories and official language use Its first version ap- to intermediate taxonomies on the basis of which peared in 2001 and it contained 187 million run- virtual corpora will be dynamically generated It ning words, morphologically annotated and tagged seems that intermediate taxonomies for topic do- The majority of the data were collected from elec- mains and text types will typically be necessary tronic sources from within Hungary but the HNC to arrive at sufficiently valid and fine-grained com- also contains subcorpora representing Hungarian mon category systems as a minority language spoken in the neighbouring The Romanian speech data (collected in countries On the design and implementation of CoRoLa) will be added into DRuKoLa when the the first release of the corpus see Váradi (2002) for appropriate processing functionality of KorAP is further details finalized The HNC has recently been substantially up- graded and extended to gigaword size This new 4 5 DeutUng release followed the original design of the cor- 2 As a second EuReCo pilot project, DeutUngwill pus but the internal proportions of the genres have start to integrate the Hungarian National Corpus been changed, mainly to do justice to the ubiqui- (HNC) into EuReCo With respect to the establish- tous social media The annotation has also been ment of an infrastructure and research methodol- overhauled and the engine and the user interface ogy for comparable corpora, DeutUng is similar to have also been modernised, employing the Mana- 3 DRuKoLA tee/Bonito framework (Rychlý, 2007) Oravecz et al (2014) describes the corpus in more details 5 Conclusions 4 4 DRuKoLA The EuReCo initiative represents an ambitious ef- Parts of the EuReCo vision have already been im- fort of building a self-sustainable and flexible basis 1 plemented in the DRuKoLA-project , large parts for comparable corpora, which is expected to of- of which can also be regarded as a pilot study fer very attractive opportunities for users but also for EuReCo (Cosmaet al , 2016) DRuKoLA challenges for developers Multilinguality, which 1 is at the root of the idea of EuReCo, together with DRuKoLA (2016-2019) is funded by the Alexander von Humboldt-Foundation, as a Research Group Linkage Pro- 2 DeutUng (2017-2020) is a co-operation project between gramme, between the University of Bucharest and the Institute IDS Mannheim and the University of Szeged with the Re- for the German Language in Mannheim, with the Institute for search Institute for Linguistics at the Hungarian Academy of Artificial Intelligence Mihai Drăgănescu (RACAI, Bucharest) Sciences as associated partner It is also funded by the Alexan- and the Institute for Computer Science (IIT, Iaşi) of the Roma- der von Humboldt-Foundation as a Research Group Linkage nian Academy as associated partners The acronym combines Programme central goals of the project: corpus development and con- 3 With respect to linguistic application, however, DeutUng trastive linguistic analysis (Sprachvergleich korpustechnolo- has as an additional focus on second language acquisition gisch Deutsch - Rumänisch) the vast repositories of language data, require in-Arts, Charles University, Prague, Czech Repub- novative and robust technical solutions The co-lic operation of several institutions and expert groups, Cosma, R , Cristea, D , Kupietz, M , Tufiş, D and as envisaged by EuReCo, promises to open new A Witt (2016) DRuKoLA – Towards Con- research avenues in the European digital human- trastive German-Romanian Research based on ities Moreover, the technical base developed in Comparable Corpora In: Bański, P , Barbaresi, EuReCo will provide support for innovative exper- A , Biber, H , Breiteneder, E , Clematide, S , iments that involve linguistic resources of differ- Kupietz, M , Lüngen, H , and A Witt: 4th Work- ent types and their interconnection Showing that shop on Challenges in the Management of Large a commonly agreed methodology can provide uni- Corpora Proceedings of the Tenth International fied access to very diverse basic level linguistic rep- Conference on Language Resources and Evalu- resentations could provide useful insights of link- ation (LREC 2016), Portorož / Paris: ELRA: 28- ing and unifying the access to diverse types of lin- 32 guistic data (corpora, dictionaries, wordnets, etc ) Davies, M (2010) The Corpus of Contempo- rary American English as the first reliable mon- 6 References itor corpus of English Lit Linguist Computing (2010) 25(4): 447-464 Aston, G and Burnard, L (1998) The BNC Hand- book: Exploring the British National Corpus Diewald, N , Hanl, M , Margaretha, E , Bingel, with SARA Edinburgh: Edinburgh University J , Kupietz, M , Bański, P and A Witt (2016) Press KorAP Architecture – Diving in the Deep Sea of Corpus Data In: Calzolari, Nicoletta et al Bański, P , Kupietz, M , Witt, A , Ćavar, D , Hei- (eds ):Proceedings of the Tenth International den, S , Aristar, A and H Aristar-Dry (eds ) Conference on Language Resources and Evalua- (2012):Proceedings of the LREC-2012 work- tion (LREC’16), Portoroz / Paris: ELRA: 3586- shop on “Challenges in the management of large 3591 corpora” (CMLC-1) Istanbul / Paris: ELRA Geyken, A (2007): The DWDS corpus: A refer- Bański, P , Bingel, J , Diewald, N , Frick, E , Hanl, ence corpus for the German language of the 20th M , Kupietz, M , Pęzik, P , Schnober, C and century Collocations and Idioms, London: 23– Witt, A (2013): KorAP: the new corpus anal- 40 ysis platform at IDS Mannheim In: Vetulani, Kupietz, M , Belica, C , Keibel, H and Witt, Z and Uszkoreit, H (eds ): Human Language A (2010) The German Reference Corpus Technologies as a Challenge for Computer Sci- DeReKo: A primordial sample for linguistic re- ence and Linguistics Proceedings of the 6th search In: Calzolari, N et al (eds ):Pro- Language and Technology Conference Poznań: ceedings of the Seventh conference on Inter- Fundacja Uniwersytetu im A , 2013: 586-587 national Language Resources and Evaluation Bański, P , Diewald, N , Hanl, M , Kupietz, M (LREC’10): 1848-1854 and A Witt (2014) Access Control by Query Kupietz, M , Lüngen, H , Bański, P and Belica, C Rewriting: the Case of KorAP In:Proceed- (2014) Maximizing the Potential of Very Large ings of the 9th conference on the Language Corpora In: Kupietz, M , Biber, H , Lüngen, Resources and Evaluation Conference (LREC H , Bański, P , Breiteneder, E , Mörth, K , Witt, 2014), European Language Resources Associ- A , Takhsha, J (eds ): Proceedings of the LREC- ation (ELRA), Reykjavik, Iceland, May 2014: 2014-Workshop Challenges in the Management 3817-3822 of Large Corpora (CMLC2) Reykjavik / Paris: Baroni, M , Bernardini, S , Ferraresi, A , and E ELRA: 1–6 Zanchetta (2009) The WaCky Wide Web: A Oravecz, Cs , Váradi, T and Sass, B (2014) The collection of very large linguistically processed Hungarian Gigaword Corpus In: Calzolari, Web-crawled corpora Language Resources and Nicoletta et al (eds ):Proceedings on the Evaluation 3/2009: 209-226 Ninth International Conference in Language Re- sources and Evaluation (LREC’14) Reykjavik / CNC (2005) Czech National Corpus – SYN2005 Paris: ELRA: 1719–1723 Institute of Czech National Corpus, Faculty of Przepiórkowski, A , Górski, R L , Łaziński, M and P Pęzik (2010) Recent Developments in the National Corpus of Polish In Calzolari, N et al (eds ): Proceedings of the Seventh confer- ence on International Language Resources and Evaluation (LREC’10) Paris: ELRA Rychlý, P 2007 Manatee/bonito–a modular cor- pus manager In:1st Workshop on Recent Ad- vances in Slavonic Natural Language Process- ing Brno, Czech Republic: Masaryk Univer- sity: 65–70 Tufiş, D , Barbu Mititelu, V , Irimia, E , Du- mitrescu, Ş D , Boroş, T , Teodorescu, N H , Cristea, D , Scutelnicu, A , Bolea, C , Moruz, A and L Pistol (2015) CoRoLa Starts Blooming – An Update on the Reference Corpus of Contem- porary Romanian Language In Proceedings of the 3rd Workshop on Challenges in the Manage- ment of Large Corpora (CMLC-3) Mannheim: IDS: 5-10 Tufiş, D , Barbu Mititelu, V , Irimia, E , Du- mitrescu, Ş , D , Boroş, T (2016) The IPR- cleared Corpus of Contemporary Written and Spoken Romanian Language In: Calzolari, Nicoletta et al (eds ):Proceedings of the Tenth International Conference on Language Re- sources and Evaluation (LREC’16), Portoroz / Paris: ELRA Váradi, T (2002) The Hungarian National Cor- pus In Proceedings of the Third International Conference on Language Resources and Eval- uation (LREC’02), Las Palmas /Paris: ELRA: 385–389 